Beyond its status as a health condition that is characterised by excessive fat buildup, obesity is a complex, multifactorial public health crisis that has been escalating globally and, according to the World Health Organisation, has nearly tripled in prevalence since 1975, affecting millions of individuals across all age brackets and making it one of today’s most blatantly visible- yet most neglected- public health challenges of the 21st century (1). Individuals with obesity are twice as likely to develop hypertension, five times as likely to develop type 2 diabetes, and have an increased risk of cancer such as colon cancer and premature mortality. This places a substantial strain on healthcare systems (2).
In the United Kingdom, the National Health Service (NHS) spends approximately £6 billion annually treating the consequences of obesity, and this cost is forecasted to increase to £10 billion a year by 2050 (3). In addition to direct healthcare costs, obesity incurs significant indirect costs to the economy, including reduced workforce productivity and increased social care use, estimated at around £7.5 billion a year (4).
The complexity of obesity arises from its multifactorial aetiology, encompassing behavioural, environmental, genetic, and socioeconomic influences (5). Within this context, lifestyle choices - specifically, physical activity levels, smoking status and alcohol intake - stand out as factors to be explored in terms of the role they play in different levels of obesity (6). Understanding the interplay of these factors is crucial for developing and implementing tailored public health policies and strategies that can address the growing epidemic at its many levels (3, 7). Addressing obesity is crucial not only for health but also for economic reasons, as reducing obesity-related conditions can lead to significant cost savings and improvements in quality of life and workforce productivity.
Thus, this research aims to analyse the complex relationships between physical activity, smoking, alcohol consumption and different levels of obesity.
Research Question
Data Source
The dataset used in this study contains information about estimated obesity levels in individuals from three countries based on family history, dietary measurements, physical activities and social behaviours (including drinking alcohol and smoking).
Variables
The dataset contains 18 attributes related to individual health, lifestyle choices, and obesity classification. Each variable is described below;
This study focuses on three lifestyle variables, including Smoking status (attribute 11), Physical activity frequency (attribute 14), and Alcohol consumption (attribute 16), to visualise the relationship between these variables and different types of obesity.
Sample Size
The dataset comprises a sample of 2,111 individuals, with each row representing a unique participant aged between 14 and 61 years.
An initial data exploration was performed to understand the data and assess the data quality:
1. Completeness
To ensure there are no missing values within the dataset, and if any,
to quantify and address them using appropriate methods (such as deletion
or imputation) depending on the pattern and context of missingness (8).
The function sum(is.na(obesity_data)) was used to evaluate
for missing values and revealed there were no missing values within the
dataset.
2. Consistency and Validity
Ensuring variables are in a consistent format and appropriate units and ranges, as inconsistencies in data could introduce bias within the data, leading to inaccurate results and visualisations (8, 9). Data inconsistencies were observed in variables- Age, Height, Weight, and categorical variables like the number of daily meals. These were corrected by rounding up height and weight to two decimal values, and Age was categorised into groups. To correct mismatched variable encoding (8, 9). Categorical variables including gender, family history of overweight, frequency of high caloric diet) encoded as ‘characters’, were corrected to factors.
3. Uniqueness
To ensure there are no duplicate entries as this can influence the
data being visualised resulting in misleading visuals and could lead to
incorrect conclusions (9). Using the function duplicated()
revealed there were no duplicate rows present in the dataset.
4. Outliers
To identify outliers in continuous variables (age, weight, height) using statistical method (Interquartile Range) and address them depending on their nature and the analysis scope of this study (9). Upon evaluation, Height and weight were found to have a single outlier each, which, upon further evaluation, appears not to influence the data quality significantly. 164 outliers were reported in Age. However, this was representative of the age distribution of the dataset which had 1827 of the total 2111 participants are within the ages of 18-35 years indicating that the dataset was skewed in terms of the age of the participants.
Research Question
Table 1. Demographic Summary Table
| Characteristics | N | Female, N = 1,043 | Male, N = 1,068 | p-value |
|---|---|---|---|---|
| Age | 2,111 | 0.002 | ||
| Mean (SD) | 24.00 (6.41) | 24.62 (6.27) | ||
| Median [IQR] | 22.00 [19.63, 26.00] | 23.00 [20.00, 27.93] | ||
| Height | 2,111 | <0.001 | ||
| Mean (SD) | 1.64 (0.07) | 1.76 (0.07) | ||
| Median [IQR] | 1.64 [1.60, 1.70] | 1.76 [1.71, 1.81] | ||
| Weight | 2,111 | <0.001 | ||
| Mean (SD) | 82.30 (29.72) | 90.77 (21.41) | ||
| Median [IQR] | 78.00 [58.00, 105.04] | 89.95 [75.00, 108.46] | ||
| Family History of Overweight | 2,111 | <0.001 | ||
| no | 232 (22%) | 153 (14%) | ||
| yes | 811 (78%) | 915 (86%) | ||
| FAVC | 2,111 | 0.003 | ||
| no | 143 (14%) | 102 (9.6%) | ||
| yes | 900 (86%) | 966 (90%) | ||
| FCVC | 2,111 | <0.001 | ||
| 1 | 49 (4.7%) | 53 (5.0%) | ||
| 2 | 342 (33%) | 671 (63%) | ||
| 3 | 652 (63%) | 344 (32%) | ||
| NCP | 2,111 | <0.001 | ||
| 1 | 194 (19%) | 122 (11%) | ||
| 2 | 55 (5.3%) | 121 (11%) | ||
| 3 | 794 (76%) | 825 (77%) | ||
| CAEC | 2,111 | <0.001 | ||
| Always | 23 (2.2%) | 30 (2.8%) | ||
| Frequently | 161 (15%) | 81 (7.6%) | ||
| no | 15 (1.4%) | 36 (3.4%) | ||
| Sometimes | 844 (81%) | 921 (86%) | ||
| SMOKE | 2,111 | 0.040 | ||
| no | 1,028 (99%) | 1,039 (97%) | ||
| yes | 15 (1.4%) | 29 (2.7%) | ||
| CH2O | 2,111 | <0.001 | ||
| 1 | 297 (28%) | 188 (18%) | ||
| 2 | 489 (47%) | 621 (58%) | ||
| 3 | 257 (25%) | 259 (24%) | ||
| SCC | 2,111 | <0.001 | ||
| no | 973 (93%) | 1,042 (98%) | ||
| yes | 70 (6.7%) | 26 (2.4%) | ||
| FAF | 2,111 | <0.001 | ||
| 0 | 475 (46%) | 245 (23%) | ||
| 1 | 305 (29%) | 471 (44%) | ||
| 2 | 226 (22%) | 270 (25%) | ||
| 3 | 37 (3.5%) | 82 (7.7%) | ||
| TUE | 2,111 | <0.001 | ||
| 0 | 450 (43%) | 502 (47%) | ||
| 1 | 493 (47%) | 422 (40%) | ||
| 2 | 100 (9.6%) | 144 (13%) | ||
| CALC | 2,111 | 0.12 | ||
| Always | 0 (0%) | 1 (<0.1%) | ||
| Frequently | 28 (2.7%) | 42 (3.9%) | ||
| no | 304 (29%) | 335 (31%) | ||
| Sometimes | 711 (68%) | 690 (65%) | ||
| MTRANS | 2,111 | <0.001 | ||
| Automobile | 166 (16%) | 291 (27%) | ||
| Bike | 0 (0%) | 7 (0.7%) | ||
| Motorbike | 2 (0.2%) | 9 (0.8%) | ||
| Public_Transportation | 854 (82%) | 726 (68%) | ||
| Walking | 21 (2.0%) | 35 (3.3%) | ||
| Obesity Types | 2,111 | <0.001 | ||
| Insufficient_Weight | 173 (17%) | 99 (9.3%) | ||
| Normal_Weight | 141 (14%) | 146 (14%) | ||
| Overweight_Level_I | 145 (14%) | 145 (14%) | ||
| Overweight_Level_II | 103 (9.9%) | 187 (18%) | ||
| Obesity_Type_I | 156 (15%) | 195 (18%) | ||
| Obesity_Type_II | 2 (0.2%) | 295 (28%) | ||
| Obesity_Type_III | 323 (31%) | 1 (<0.1%) |
FAVC (High Caloric Food Consumption, no/yes), FCVC (Vegetable Intake, scale: Never (1), Sometimes (2), Always(3)), NCP (Main Meals Daily, scale: between 1 and 2(1), three(2), more than three(3)), CAEC (Snacking between meals), CH2O (Water Consumption Daily, scale: <1L(1), 1-2L(2), >2L(3)), SCC (Calorie Monitoring), FAF (Physical Activity Frequency, scale: none (0) ,1 or 2 days (1), 2 or 3 days (2), 4 or 5 days (3)), TUE (Tech Use Duration, 0-2 hours (0), 3-5 hours (1), more than 5 hours (2)), CALC (Alcohol Consumption), MTRANS (Usual Transport Method), NObeyesdad (Obesity Categories).
The Visualisation
The summary table displays an overall view of the data, and it shows that the data largely comprises of individuals with ages 18-35 years. This also speaks to the generalisability of findings derived from the dataset, implying that such findings may not be applicable to the wider population. Likewise, obesity type II and III revealed a major disproportion in male and female participants, with obesity type II consisting only 2 participants as opposed to the 295 male participants within the same category. Obesity type III revealed 1 male participant and 323 female participants. This disproportion may allude to some data quality issues during data collection process.
Justification of Design Choice
A summary table provides a clear high-level overview of the data and is effective for visualising a dataset such as this with multiple variables which otherwise would have been challenging to represent all the multidimensional variables in a single bar plot, heat map or correlation plot.
The summary table also allows for a comparison across different demographic characteristics as shown in Table 1. And this feature is particularly useful in health data where such stratification can potentially provide clarity into health disparities or outcomes. However, while this summary table gives a clear overview of the data, it is not as effective as charts or graphs in representing patterns, trends or outliers that may be present within the data. Nevertheless, a summary table appears to be optimal for presenting a comprehensive overview of the data in a way that is statistically detailed and accessible.
In addition, considering that the target audience of this study would include policymakers and healthcare managements, this table provide a straightforward presentation of figures and is an effective method of communicating the data to such audience (10, 11).
Accessibility Considerations and Visualisation Principles
To improve readability for individuals with visual impairments the font size was set to 16. Bold labels were used to aid visual distinction of headings and improve scannability (10).
The table width was adjusted for to prevent the data stretching
across the screen, aiding readability (10). The table accessibility was
also improved by including using scroll_box to ensure ease
of navigation for users. kable_styling with
bootstrap_options were also utilised to improve readability
and navigation for users.
For clear labelling, columns were renamed to be more descriptive and improve clarity as well as ease of understanding. For effective data representation, mean, median and percentile were used to describe continuous variables while counts and percentages were utilised for categorical variables.
The Visualisation
Faceting by obesity levels allows for a direct comparison of physical activity, smoking status, and alcohol consumption across the three obesity types and addresses the research question. In achieving this, aesthetic mapping was used to categorise and differentiate between the levels of physical activity, smoking status, and alcohol consumption across obesity categories.
Visualisation 3 however revealed a major disproportion between smokers and non-smokers across the three obesity categories. However, further exploration also revealed a large disproportion between non-smokers and smokers in the entire data, with ~98% of participants responding ‘no’ to the question on smoking status. Table 1. shows the proportion of male to female smokers and non-smokers within the data.
Justification of Design Choice
Considering that the variables (Physical activity, Smoking Status Alcohol Consumption and Obesity levels) are categorical, a bar chart displays the association in a simple and effective way that is intuitive, allowing the audience to quickly understand the data (10).
An alternative approach could be to use a mosaic plot or grouped bar which is more compact, however it may present an accessibility challenge as they may not be as intuitive as a bar chart to the general population.
In visualising alcohol consumption (visualisation 3),
ggplot object was converted into an interactive chart using
plotly providing additional information about the chart
when tooltips are used and enhance accessibility. Axis labels and
descriptive titles were included to aid user understanding. All bar axis
were set at zero to prevent presentation of misleading visuals (10).
Accessibility Considerations and Visualisation Principles
As a colourblind-friendly approach for presenting the visualisation
and to avoid reliance on colour differentiation, a single fill colour
(#1F4E79) was used for the bar chart showing the
association between physical activity and obesity (10).
To improve readability, suitable font sizes were used for the axis
and title texts, and for ease of understanding and clarity, labels were
directly added to the bars using (geom_text()) as this
makes it easier for users to understand the data without
cross-referencing with the axis (10).
For visualisation 2, with clustered labels on the x-axis, orientated was set angle 45 ° for improved legibility as opposed to using horizontal labels, and grid lines were omitted to help focus attention on the data itself (10).
Summary of Findings
The chart reveals a trend where higher levels of obesity are associated with lower levels of physical activity and higher alcohol consumption. The chart on smoking status revealed a significant disproportion in data which would require further clarification.
Implications for Policy and Public Health
The observed trends in physical activity and alcohol consumption reveals potential areas for public health intervention. The inverse relationship between physical activity and obesity types suggests a need for policies that promote physical activity, especially for individuals at higher obesity levels, and as a preventive measure for individuals in overweight categories. Likewise, alcohol reduction interventions could be beneficial considering the positive association between alcohol consumption and obesity levels.
Limitations
Some limitations associated with this work include;
1. The self-reported nature of the data: This could be a source of bias in reporting of physical activity, smoking and alcohol consumption.
2. Representativeness: The data may not be representative of the wider population implying that finding from this data may not be generalisable.
Future Work
For future work, statistical analysing combinations of the variables may be beneficial in exploring this topic further.